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Abstract 

We propose a new algorithm to do posterior sampling of Kingman's coalescent, based upon 
the Particle Markov Chain Monte Carlo methodology. Specifically, the algorithm is an in- 
stantiation of the Particle Gibbs Sampling method, which alternately samples coalescent times 
conditioned on coalescent tree structures, and tree structures conditioned on coalescent times 
via the conditional Sequential Monte Carlo procedure. We implement our algorithm as a C++ 
package, and demonstrate its utility via a parameter estimation task in population genetics on 
both single- and multiple-locus data. The experiment results show that the proposed algorithm 
performs comparable to or better than several well-developed methods. 



1 Introduction 

Data shows hierarchical structure in many domains. For example, computer vision problems often 

In text mining, documents can be 
Teh et al.l 120061. Algorithms that 



Lee et al. 


2009 


Blei et al. 


, 2003 



involve hierarchical representation of images 
modeled as hierarchical generative processes 
can effectively deal with hierarchical structure play an important role in uncovering the intrinsic 
structures of data. 



We focuses on a hierarchical model in population genetics, the Kingman's n-colescent Kingman 
1982a|b , shortly referred to as the coalescent. It is a tree-structured process that models the 
genealogical relationship among a set of individuals, represented by some feature vectors (typically 
the DNA sequences) . Although forward simulation of genealogy tree [Stephen's and Donnelly} |2000| 
is simple, posterior inference is much difficult due to the complicate state space, which requires 
combinatorial search over tree structures and high-dimensional sampling of coalescent time series. 
Many works have been done in this area, and are roughly grouped into three methodologies, i.e., 
Importance Sampling (IS), Markov Chain Monte Carlo (MCMC), and Sequential Monte Carlo 
(SMC). The IS approaches [Griffiths and Tavare 1994a, b Stephens and Donnelly, 2000 simulate 
mutation events explicitly, and require coalescent events happen between identical sequences, so 



aren't efficient for data of highly diverse sequences. The MCMC approach |Kufmer et al. , 1995 



Kuhner, 2006 



uses random structure modification to explore the tree space. Although the samples 
are cheap to simulate computationally, they don't represent the posterior distribution well, which 
makes the whole algorithm not accurate. The SMC approaches [Teh et al 



2008, Goriir and 



1 



Teh, 2009 use the so-called "local posterior" to propose coalescent events & times. The pair- wise 



numerical integration may bring potential scalability problem. But recent improvement using "pair 



similarity" heuristic seems to alleviate the problem in some sense Goriir et al. , 2012 



In this paper, we propose a novel inference algorithm for the coalescent based upon the Particle 
Markov Chain Monte Carlo (PMCMC), a methodology recently developed in statistics community 
[Andrieu et al. 2010| . More specifically, it falls under the framework of the Particle Gibbs Sam- 
pling (PGS), which is a case of PMCMC that targets joint posterior distribution of parameter and 
hidden variable. Our core idea is to decouple coalescent times (tree branching lengths) and tree 
structure, view th e former as the parameter while the later as the hidden variable, and alternately 
sample one conditioned on another. Specifically, we simulate the coalescent times by Gibbs sam- 
pling, and simulate the tree structure by the conditional Sequential Monte Carlo (cSMC). Although 
computationally expensive, this approach explores the state space and generates tree structures in- 
formatively. We demonstrate this by estimating the likelihood surface of 0, a parameter of the 
coalescent which captures the mutation rate and the size of the background population. Experi- 
ments show that our method performs comparable to or better than the well-developed methods 
as mentioned previously. 

The following paper is organized as follows: In Section 2, we introduce the coalescent in the 
context of graphical models and probabilistic inference. In Section 3, we explain the challenging 
problems of coalescent posterior sampling and genetic parameter estimation, and review current 
approaches. In Section 4, we formally describe our algorithm, Particle Gibbs Sampler for the King- 
man's Coalescent (PGS-Coalescent). Experimental results are presented and analysed in Section 
5. We give more discussion and conclude in Section 6. 



2 Kingman's n-Coalescent 



Kingman's coalescent is proposed in population genetics to model the genealogy upon the haploid 
population Kingman, 1982a|b| . Briefly speaking, the coalescent is a tree-structured process that 
starts from the current population, and evolves backward with random mutation and coalescent 
events until the Most Recent Common Ancestor (MRCA) is reached. Besides the successful ap- 
plications in population genetics, it also becomes popular as a hierarchical clustering model in 
machine learning Teh et al. 2008 Goriir and Teh, 2009, Goriir et al. 2012 . The state space 



representation of the coalescent varies among different inference algorithms and applications, but 
falls into two broad categories. One represents the coalescent events and the coalescent times but 
not mutation events or their times [Kuhner et al. 1995 Slatkin, 2002, Goriir et al. 



the other represents the coalescent and mutation events but not their times Griffiths and Tavare 



2012 



while 



1994a|b[ [Stephens and Donnelly} [2000| . In this paper we adopt the former representation. Figure 



T] illustrates the coalescent tree under such representation. Notations are summarized as follows. 

Denote a subset of a current population as X = {x\, x n }, where n is the number of individuals 
in the subset. They compose the leaf nodes of the coalescent. Starting from X, n — 1 coalescent 
events will happen until MRCA is reached. They are represented as the hidden nodes, and fully 
describe the coalescent structure. Further more, these events happen sequentially at time T = 
{ii, i n -i}, with hidden states Y = {y%, y n -i}- For any event i, if applicable, denote its 
waiting time since previous event i — 1 as 5i = — ti, its parent as o(i), its sibling as and its 
children as {ci(i), c r (i)}. Mutation events can happen between any pair of parent- and child-nodes, 
but are represented implicitly as a continuous-time Markov process with transition rate matrix 9R, 
where 6 oc Nfi is a genetic parameter that is controlled by the effective population size N and 
neutral mutation rate per site per generation n iKuhner 20061 . Accordingly, the transition matrix 
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Figure 1: An Example of Coalescent Process 



from the parent to the child is T u\ t i = e e ^ li t °{i)> R . Finally, denote the equilibrium distribution as 
Po, which satisfies PqR = 0. 

A coalescent tree Q is specified by the tree structure and the branching lengths. At each event 
i, any pair of nodes that remain in the population is equal likely to coalesce. The pairing structure 
of the nodes defines the genealogy structure. The prior probability of any genealogy structure 

n ~ l (n — i + 

is p(S) = Y\ ( 2 ) ' w ^ e * ne wa itmg time 5i follows exponential distribution with rate 

n — i + 1 
2 



So the prior probability of the coalescent tree Q = (S, T) is, 



n-l 



p{G) = p{S)p{T) = \\ exp 



i=l 



n 



-i + l 
2 



(1) 



2.1 Belief Propagation in the Coalescent 



If Q and are known, one can solve pg(X\Q) elegantly with belief propagation (BP) Felsenstein 



198lj|PearHp86] . Note that |Teh et al.[ [20081 |Goriir and Teh[ [20091 |Goriir etaTj |2012| apply the 
upward part of BP. The key difference of our algorithm is to do propagation on both directions, 
which allows Gibbs Sampling targeting pg(T\X,S). While algorithmic details will be discussed in 
next section, here we provide a brief description of BP. 

BP in coalescent has two rounds. First, the leaf nodes propagate messages upward until the 
root. Second, the root collects incoming messages and propagates them downward until the leaves. 
After the two-round propagation, one can do inference at any internal node with incoming messages 
to it. Formally, the message from a leaf node i £ {1, 2, n} is, 



1 



m 



>o(i) 



(2) 



0{i) 
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the messages associated with a non-root hidden node i G {n + 1, n + 2, 2n — 2} are, 



— n s ^.^w^ 



m i^o(i) - 7^ II 7 > ^aW^W-**) 

i->-0(i) b=l>r y cb{i) 



™-f.^-.f* = — ^2 T o{i),i m o{i)-^i ^2 T iXc b (i)) m s(c b (i))^i ( 3 ) 
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>Cj,(i) 



and the messages associated to the root node i = 2n — 1 are 

m ^ C6 W = Y d S M^)^^©) 11 ^^))-^ (4) 

*-»<%(i) 2/ s ( Cfe (i)) 

In Equation ([2]), Q and Q, 6 G represents left and right branches of the selected node, 'u' 



and 'd' represent message directions, upward and downward separately. Similar to Teh et al. , 2008 



the normalization constant Z is set such that ^2 v Po( v ) m i v ) = 1- Hence the likelihood function of 
X can be expressed at a non-root node as, 



2n-2 



p g (x\g)= Yl 4 EE T «( ! ).« m »( 1 )^ II E ^^w^w^' ( 5 ) 

i=l>jy* Vi Vo(i) b=l,ry % (i) 

and at root node as, 

2n-2 

P$(X\G) = Yl Z j Po(Wn-l) II T 2n-l,c f) (2n-l)m C()(2n _ 1) ^2n-l (6) 

j=l 2/2n-l b=l,r Vc b (2n-1) 

where the normalization constants {Z^} are of those messages which direct toward node i. The 

subscript g emphasizes the dependency of the probability to mutation rate through {Zj } and 
{T (i),i}- Accordingly, the joint probability of the data and the coalescent pg(X,G) = pe(X\G)p(G) 
can be computed with Equation , and ^ or (|6j) . 

In biological applications X are typically sequence data of multiple sites, for example, allel or 
DNA sequences. Denote the sequence length as L. Assuming each site mutate independently, it is 
easy to show the message m^j and the corresponding normalization constant Z^j are given by 



L 



(0 

in 

L 



1=1 



= n (7) 

i=i 



where mf^j, Z fXj are the messa g e an d the normalization at each locus I G {1, 2, L}. Therefore, 
BP still applies to the multiple-site case. 



3 Sampling and Parameter Estimation of the Coalescent 

In real applications, G and 6 are usually unknown variables which are of interest to geneticists. 
For example, given a sub-population, people may want to estimate the likelihood surface of 9 



4 



|Steph ens and Donnellyl 12000] , which will shed light on the mutation rate and size of the background 
population, 



(8) 



L(e) = / Pe (x,g)dg 

Jg 

Exact inference is intractable, because L{0) involves combinatorial summation of n!(n — l)!/2 



n-l 



possible coalescent structures Stephens and Donnelly 2000| , each of which involves an 0(n 



dimensional integration over coalescent times. Therefore, people usually resort to Monte Carlo 



(MC) methods [Liu 2008] . Researches in this problem roughly belong to three categories, using 
different methodologies and coalescent representations. In the following, we briefly review the three 
categories, and a new methodology which will be the basis of our proposed algorithm. 



3.1 Importance Sampling 

Griffiths and Tavare |1994a|b| pioneered coalescent inference using Importance Sampling. Stephens 



and Donnelly 



2000 



popularized this approach. They represent the coalescent as an ordered list 
of partitions evolving forward, Q = (iT_ m , fl-i, -f^o)- At each generation, a partition is selected, 
and then mutation or split is applied to one of its individual. This process goes until population 
size grows to n + 1. Stephens and Donnelly |2000| exploits the character of pg(Q\X), and invents 



an informative proposal distribution that evolves backward and converge to p$(0\X) in the parent- 
independent mutation (PIM) case. In IS, the marginal likelihood of 8 is approximated by, 
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(9) 



3.2 Markov Chain Monte Carlo 



Kuhner et al. 1995 is the first work to apply MCMC to coalescent inference. Its implementation 



genetic parameters. See Tavare 



approaches. Kuhner et al. 



1995 



is integrated in the LAMARC software Kuhner, 2006 , a very popular analysis tool of population 



2004, Wakeley, 2009 for a review of this work and other MCMC 



represents coalescent events and time (scaled in units of mutation 
rate), and uses a Metropolis-Hastings scheme to explore the genealogy space. The proposal distri- 
bution is very simple: 1) randomly break a non-root node from its parent. 2) randomly coalesce 
the node to another branch with exponentially distributed time. Their MCMC approach computes 
the relative likelihood of 6 as, 



HO) 
L(9 ) 
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(10) 



3.3 Sequential Monte Carlo 



SMC approaches recently published in machine learning [Teh et aL 2008, Goriir and Teh, 2009 



Goriir et al. , 2012 use a similar state space representation to the MCMC approaches. They start 



from the current population, and iteratively select coalescent pairs and times based upon the 
so-called "local posterior" distribution. Teh et al. (2008 1 introduces three proposal distributions 
from the local posterior, SMC-Prior, SMC-PriorPost, and SMC-PostPost. In particular, PostPost 
particles are of the highest quality, because it samples exactly from the local posterior. However, 
it is computationally intensive with complexity 0(n 3 ) per particle due to the pair-wise integration 
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at each coalescent event. The scalability problem is addressed in Goriir and Teh, 2009 and |Goriir 
et al. , 2012 . The general form of the likelihood approximation for SMC takes the following form 



M 



L{6) « £ 



w 



(0 



(ii) 



where wry is the particle weights. 



3.4 Particle MCMC 



PMCMC |Andrieu et"alj|2010| can bee viewed as a integration of MCMC and SMC (or IS) methods, 
where each MCMC iteration samples SMC particles to build the proposal distribution. As the SMC 
particles are an empirical approximation of the posterior distribution, the proposal move can be 
more effective, and less apt to being stuck in local minima. This methodology has not been applied 
to coalescent parameter inference before. We detail our contribution in this aspect in the next 
sections. 



4 Towards a New Inference Algorithm 

We now propose a novel approach to sample from the posterior pg (5, T\X) based upon the Particle 
Markov Chain Monte Carlo method. The core idea is to decouple tree structure and coalescent 
times, and alternately sample each of them conditioned on the other, via the Particle Gibbs Sampler 
(PGS) [Andrieu et al. 2010] . We hope this treatment will build proposal distributions which are 
close to p(S, T\X, 9q), and makes the sampling efficient in terms of particle quality. 

First, we use Gibbs sampling to sample coalescent times from p(T\X, S k ). Posterior probability 
of U conditioned on the rest of the configuration of the genealogy is, 

p(U\X, S, T-i) oc p(X\S, T-i, U)p(S, T-i,U) (12) 

The right hand side of Equation (12) is a product of and ^ or Q, which only depends on t{ 
at a few terms. More specifically, for non-root hidden notes, 

p(ti\X,S,T-i) ocexpj- y ^ M (U -t i+ i)\ -expj- ^ n * + ^ (*t-i _ <i) 

" XT T o(i),i m o(i)^i II Y T iMi) m c b (i)^i, (13) 
Di Vo{i) b=l,rVc b (i) 

where max(t i+ i, t ^) < U < min(^_i, i C;(i) , t Cr ^). For root note, 

p(ti\X,S,T-i) ocexp i - ( n ^ + 1 J (U-i ~ k) \ ■ YPoiVi) II ^ r *,c6(i) m £%(i)->t, ( i4 ) 

where — oo < U < min(tj_i, t Cl (j), ^(i))- {U,i = 1,2, ...,n — 1} are sampled from above equations 
iteratively. Note that at each iteration, the node with updated time should propagate messages 
outward to the rest of the whole tree, so that other nodes collect updated messages when their 
coalescent times are to be sampled. 

Second, we use conditional SMC to sample the structure from p(S\X, T k+1 ). S corresponds 



to an ordered list of coalescent events (si, s n -i)- We adopt the "local posterior" idea Teh 



6 



et al. 2008| to sample the coalescent pair (see Eq.(lO) in Teh et al. 2008 for example). But the 



fundamental difference is, we sample the topology conditioned on the coalescent times, so avoid the 
costly computation of pairwise integration. Actually, as <5j is fixed, we only have to evaluate local 
posteriors at the given coalescent times, 



q(pi,p r \X,U) oc Z^ o{i) (pi,p r ,U) 



(15) 



for all existing pairs, and sample the tree structure from a series of multinomial distributions 
composed of these local posteriors. The normalization constants are, 



^2 Zl^ o(i) {p h p r ,ti),i < n - 2 



Lr 



w 



n-1 — /^Poiy) ^ T2n-l,c b (2n-l) m c b (2n-l)^2n-l 

y b=l,r Vc h 

Suppose SMC generates M particles, then the posterior is approximated by, 

M 



(16) 



p(S\X,T) = Y,W^5 s(m) (S), 



(17) 



m=l 



where, 



W (m) K P(SM,T,X) p(X\SW,T) 
q (SM\X,T) q(S( m )\X,T) 



(18) 



The second ratio holds because is constant regardless of 5( m ) by Eq.Q. Moreover, the 

numerator p(X\S^ m \T) is computed with Eq.Q. For Sequential Importance Sampling (SIS) (a 
special case of SMC without resampling), the weights are, 



n-1 



w(m) n 



w 



(19) 



8=1 



using Equation Q, (15) and (16). For SMC described by Andrieu et al. [2010 , the resampling step 
at each iteration i implicitly counts for the weight Wj_ i, so the particle weight is just the weight 
of the last iteration, i.e., 



P(S{ZI V T,X) 



oc w 



(m) 
n-1 



(20) 



P{^l-,n-2i T, X)q(S ( n _\\sl.J_ 2 ) 

To this point, we specify our SMC procedure to build a particle approximation of the conditional 
distribution p(S\X, T). As noted by Andrieu et al. [2010 , we keep the structure from previous PGS 
iteration as one current particle to follow the legitimate cSMC requirement. Finally, we sample the 
new structure S k+1 from the set of weighted cSMC particles. 

Our Particle Gibbs Sampling algorithm for Kingmans' n-Coalescent (PGS-Coalescent) is de- 
scribed in Algorithm [TJ It computes the relative likelihood of 9 as, 



L{e) 



i M 

M 5^ 



p(x\gw,e) 



L(o ) M^ iP (x\g(i),e ) 



(21) 
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Algorithm 1 Particle Gibbs Sampler for the Kingman's Coalescent (PGS-Coalescent, abbr. 
PGSC) 

Initialize (5, T) = (5°, 7"°) randomly, where S is genealogy structure, and T = ti, t n _i is time 

of coalescent events. 

repeat 

1) sample T k+l ~ p(T\X,S k ) using Gibbs sampling with Equation (|13|) , (|14|) , update messages 
with Equation ([3]), Q after each tj is changed 

2) run a conditional SMC algorithm targeting p(S\X, T k+1 ) with proposal distribution (15), 
and get a set of weighted particles {S^ 

3) sample S k+1 
until Equilibrium 



p(S\X,T k+1 ) with Equation (pi) or d20 ) 



,5^} which include 5 fc 



5 Applications 



We compare the performance of PGS-Coalescent with the IS, MCMC and SMC based methods in 
this section. We simulate coalescent samples and use Equation (JlOj), ([llf and ^ to evaluate 
the likelihood surface of 6. Our major concern is particles' quality. Specifically, good approaches 
should use fewer samples to reconstruct more accurate likelihood surfaces. We also monitor CPU 
time, but only for the purpose of evaluating scalability, as different algorithms are implemented 
in various languages. Experiments are performed on an Intel i3 PC. To make the time scalability 
analysis reliable, we run each algorithm without any parallelization. 



5.1 Binary Alleles 



The dataset is from Griffiths and Tavare 1994b . It has 50 individuals of 20 sites. Each site has 



binary alleles. The unit mutation rate matrix is, 



R 



-0.5 
0.5 



0.5 
-0.5 



(22) 



We run IS, MCMC, SMC1 and PGSC on this dataset randomly for five times. The (relative) 
likelihood surface is shown in Figure [2j We can see MCMC method and SMC method performs 
not good in this dataset. The MCMC method didn't capture the mode of likelihood surface. The 
SMC method can get a rough shape of the surface, but the magnitude varies a lot. PGSC generate 
comparable results to IS, the state-of-the-art, with much fewer samples. 



5.2 Microsattelites 



Next, we consider the microsatellite data used by Stephens and Donnelly 2000 , with state space 
{0, 1, 19}, under stepwise mutation model, where each state can only mutate to one of its neigh- 
bors with equal probability. 



5.2.1 One-locus case 

The data is {8, 11, 11, 11, 11, 12, 12, 12, 12, 13}. We run IS, MCMC, SMC1 and PGSC on it randomly 
for five times. The (relative) likelihood surface is shown in Figure [3| In this simple scenario of 
one-site and ten-individual, IS, SMC1 and PGSC methods perform well, while SMC1 is slightly 
better than the other two. MCMC methods still cannot capture the accurate shape of the likelihood 



S 



(a) 



(b) 




(c) 



(d) 



Figure 2: (Relative) likelihood surface of sequence data from [Griffiths and Tavare 1994b by 
different methods, each run randomly for 5 times, (a) 10000 samples of IS method; (b) 1,000,000 
iterations of MCMC method, first 50,000 iterations discarded, the rest thinned at interval of 200, so 
4750 valid samples; (c) 50000 samples of SMC1 method; (d) 2000 iterations of PGSC, each with 200 
cSMC particles and 50 Gibbs sampling rounds, discard first 800 iterations, no thinning, yielding 
1200 valid samples 



surface. Again we note that PGSC uses fewest samples to achieve comparable results to SMC1 and 
IS. 

5.2.2 Five-loci case 

The dataset consists of 60 5-sites individuals. We run IS, MCMC, SMC1 and PGSC on it randomly 
for four times. The (relative) likelihood surface is shown in Figure |4j This is a challenging example, 
and all of these methods show some drawbacks. The IS method can capture the shape and mode 
well but not the magnitude, while SMC1 only get a rough shape. As before, MCMC still cannot 
capture the mode. PGSC is performing comparable to IS, but the diversity of particles becomes 
poor. 
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(a) 



(b) 





(c) 



(d) 



Figure 3: (Relative) likelihood surface of one-locus microsatellite data from Stephens and Donnelly 



2000 



by different methods, each run randomly for 5 times, (a) 10000 samples of IS method; (b) 
1,000,000 iterations of MCMC method, first 50,000 iterations discarded, the rest thinned at interval 
of 200, so 4750 valid samples; (c) 10000 samples of SMC1 method; (d) 1000 iterations of PGSC, each 
with 40 cSMC particles and 10 Gibbs sampling rounds, discard first 400 iterations, no thinning, 
yielding 600 valid samples 



5.3 Scalability evaluation 

In previous experiments, the problem scale increases from microsatellite one- locus data, to binary 
allele data, and then microsatellite five-locus data. The CPU time to generate results shown in 
Figure [2j [3] and |4] is recorded in Table [I] (averaged over the several random runs for each method). 
We can see the MCMC method is least sensitive to problem scale. This is expected because its 
proposal distribution is simplest among all the methods. Comparing the two best approches in 
terms of accuracy, i.e., IS and PGSC, one can see IS is much faster than PGSC in all cases. 
However, IS slows down quickly when the data size and diversity grows up, and PGSC looks better 
in terms of scalability. 
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Figure 4: (Relative) likelihood surface of five-locus micro-satellite data from Stephens and Don 



nelly [2000] by different methods, each run randomly for 4 times, (a) 500,000 samples of IS method; 



(b) 1,000,000 iterations of MCMC method, first 50,000 iterations discarded, the rest thinned at in- 
terval of 200, so 4750 valid samples; (c) 10000 samples of SMC1 method; (d) 2000 iterations of 
PGSC, each with 200 cSMC particles and 60 Gibbs sampling rounds, discard first 1200 iterations, 
no thinning, yielding 160 valid samples 

Table 1: CPU time (in seconds) of different algorithms on previous experiments 



Method 


One-locus microsatellite data 


Binary allele data 


Five-loci microsatellite data 


IS 


8 


65 


15300 


MCMC 


13450 


56000 


76100 


SMC1 


1693.3 


NA 


43200 


PGSC 


905 


13050 


71000 



6 Discussion and Conclusion 

Algorithm [T] includes Gibbs Sampling and cSMC as two sub-routines during each iteration. Rigor- 
ously, its complexity depends on three factors. 1) m-j-, number of iterations to reach equilibrium of 
p(T\X, S); 2) ms, particle capacity that moderately models p(S\X, 7~); 3) mg, number of iterations 
of PGS to reach equilibrium of p(Q\X). Taking BP and structural sampling of Equation (15) into 



11 



consideration, the complexity is 0(mg(m-j-n 2 + m l sn 3 )). Although computationally intensive, we 
observed that the algorithm performs reasonably efficient in practice, thanks to two advantages. 
First, cSMC provides a good approximation of p(S\X, T). Second, mq-^ms can be set small but 
still generate high quality samples. These can be seen from previous experimental performance. 

In essential, PGSC is an MCMC approach, but our algorithm is fundamentally different from 
the Metropolis-Hastings based method [Kuhner et al. 1995, Wakeley 



2009 , where structures and 



times are updated independently, and the rearrangement of the tree structure is random. In terms 



of state space representation, we are very close to the SMC approaches Teh et al. , 2008 , Goriir 



and Teh, 2009, Goriir et al., 2012 



However, SMC is a sub-routine in our framework, and the 
conditional sampling S\T avoids intensive pair- wise integration, which the previous works have to 
deal with. Our approach is also fundamentally different to the IS method |Stephens and Donnelly 



2000| in both state-space representation and methodology. The IS method explicitly simulates each 
mutation event, and coalescent events only happen between identical individuals. This could bring 
computational trouble when the data is of high dimension and very diverse. We already observe 
this when studying the five- locus microsatellite dataset. 

To conclude, this paper has introduced a novel inference algorithm (PGS-Coalescent) for King- 
man's n-Coalescent based upon the Particle MCMC methodology. It alternately samples tree 
structures and coalescent times conditioned on one another. We illustrate the utility of our algo- 
rithm by a parameter estimation task in population genetics (6). Experimental results show that 
the proposed method performs comparable to or better than several well-developed approaches. 
PGS-Coalescent could be further optimized by designing other efficient proposal distributions and 
reusing particles, which will help to improve conditional SMC and to reduce computational inten- 
sity. 
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